Statistical significance of ungapped sequence alignments.

نویسندگان

  • N N Alexandrov
  • V V Solovyev
چکیده

Statistical significance of a local sequence alignment depends not only on the similarity score and on the sequence lengths, but also on a length of the alignment. Dependence of the alignment significance on the length of the sequences has been analyzed earlier, and is based on the idea that the longer sequences have more chances to share a local similarity with a bigger score. To the best of our knowledge, a dependence of the statistical significance on the length of an alignment has not been used in selecting the best alignments. We have applied to real proteins formulas for assessing the statistical significance of ungapped local alignments. Let L be a length of the alignment, then the expected value of a similarity score is Sexp = * L, where is the expected similarity between two randomly chosen residues. Value of can be calculated from a similarity (substitution) matrix M and amino acid frequencies P. = sigma ij pi*pj*mij. The probability of observing a score S greater than or equal to x for an alignment of length L is given by the normal distribution: Prob(S > or = x) = 1-integral of N ((S-Sexp)/sigma) = 1-integral of N((S-*L)/sigma m square root of L), where sigma m is a standard deviation of m. From these formula, we conclude, that we should select the best alignment using a normalized value of the similarity score as follows: S' = max ¿(S-*L)/ sigma m square root of L¿. The proposed normalization of the similarity score has been tested on the representative benchmark. To evaluate a performance of the normalization, we have calculated several measures of the recognition quality. Our normalization has improved all these measures. This procedure is important for choosing the correct alignment for homology modelling as well as for selecting distantly related sequences in databases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Significance for NGS Reads Similarities

In this work we present a significance curve to segregate random alignments from true matches in by identity sequence comparison, especially suitable for sequencing data produced by NGS-technologies. The experimental approach reproduces the random local ungapped similarities distribution by score and length from which it is possible to asses the statistical significance of any particular ungapp...

متن کامل

Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition

A central question in pairwise sequence comparison is assessing the statistical significance of the alignment. The alignment score distribution is known to follow an extreme value distribution with analytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple...

متن کامل

Generalized affine gap costs for protein sequence alignment.

Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively ...

متن کامل

Efficient large-scale sequence comparison by locality-sensitive hashing

MOTIVATION Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with ...

متن کامل

ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches

SUMMARY We have developed a tool, called ProbeMatch, for matching a large set of oligonucleotide sequences against a genome database using gapped alignments. Unlike most of the existing tools such as ELAND which only perform ungapped alignments allowing at most two mismatches, ProbeMatch generates both ungapped and gapped alignments allowing up to three errors including insertion, deletion and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

دوره   شماره 

صفحات  -

تاریخ انتشار 1998